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ABSTRACT 

The kinematics, dynamics, Jacobian, and their corresponding inverse computations are six essential problems 
in the control of robot manipulators. Efficient parallel algorithms for these computations are discussed and 
analyzed. Their characteristics are identified and a scheme on the mapping of these algorithms to a reconfigurable 
parallel architecture is presented. Based on the characteristics including type of parallelism, degree of parallelism, 
uniformity of the operations, fundamental operations, data dependencies, and communication requirement, it is 
shown that most of the algorithms for robotic computations possess highly regular properties and some common 
structures, especially the linear recursive structure. Moreover, they are well-suited to be implemented on a single- 
instruction-stream multiple-data-stream (SIMD) computer with reconfigurable interconnection network. The model 
of a reconfigurable dual network SIMD machine with internal direct feedback is introduced. A systematic pro- 
cedure to map these computations to the proposed machine is presented. A new scheduling problem for SIMD 
machines is investigated and a heuristic algorithm, called neighborhood scheduling, that reorders the processing 
sequence of subtasks to reduce the communication time is described. Mapping results of a benchmark algorithm are 
illustrated and discussed. 

1. Introduction 

Robot manipulators are highly nonlinear systems and their dynamic performance is directly dependent on the 
efficiency of the kinematic and dynamic models, the control schemes/algorithms, and the computer architecture for 
computing the control schemes. In general, robot manipulators are usually servoed in the joint-variable space while 
the objects to be manipulated are usually expressed in the world (or Cartesian) coordinate system. In order to con- 
trol the position and orientation of the manipulator end-effector, the robot controller is required to compute, at a 
sufficient rate, such tasks as coordinate transformation between the joint-variable space and the Cartesian space, 
generalized forces/torques to drive the joint motors, the manipulator inertia matrix for model-based control schemes, 
and the Jacobian matrix which relates the joint velocity in the joint-variable space to the Cartesian space. These are 
the basic robotic computations for the control of robot manipulators. They are equivalent to the computations of 
kinematics, dynamics, Jacobian, and their corresponding inverses. These six basic robotics computations are 
requited at various stages of robot arm control and computer simulation of robot motion, and reveal a basic charac- 
teristic and common problem in robot manipulator control — intensive computations with a high level of data 
dependency. They have become major computational bottlenecks in the control of robot manipulators. Despite 
their impressive speed, conventional general-purpose uniprocessor computers cannot efficiently handle the kinemat- 
ics and dynamics computations at the required computation rate because their architectures limit them to a mostly 
serial approach to computation. Furthermore, less efficient, serial computational algorithms must be used to com- 
pute these robotics computations on a uniprocessor computer. Consequently, the quest for real-time robot arm con- 
trol and motion simulation rests on the study and development of parallel algorithms of lower computational com- 
plexity with faster computational structures. The ultimate goal is to achieve an order-of-magnitude and/or an 
order-of-complexity improvement in computational efficiency in these robotics computations by taking advantage of 
parallelism, pipelining, and architectures. 

A common feature of today’s research on robotic computational problems is that a specific problem, mostly 
the inverse dynamics or the inverse kinematics, is studied at a time, and usually an algorithmically-specialized 
architecture or processor is developed for that particular algorithm. Obviously, this specialized architecture can 
make the most use of the parallel properties of the algorithm. However, most advanced robot control schemes 
always require to solve a combination of some or all of the six basic robotic computations. One solution for this 
problem is to wire these specialized architectures or processors together. This method is inflexible because the com- 
bination of these components is dedicated to a particular control scheme and cannot be used efficiently for another 
scheme. Another solution is to connect the architectures or processors to a bus as peripherals of a general-purpose 
computer. This is more flexible, but the bus becomes a bottleneck and time is wasted in data movements between 
different computational processes. Another possible solution is focussed on partitioning the original algorithm/task 
into a set of subtasks with precedence relationship and then developing efficient scheduling algorithms to map these 
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subtasks onto a general-purpose multiprocessor system. This solution is much more flexible because most computa- 
tional algorithms can be represented by directed task graphs. However, this approach may result in ignoring some 
inherent parallelism in robotics algorithms. 

In this paper, we shall address these robotic computational problems, and major effort is focussed on finding a 
scheme which provides the flexibility needed to solve robotic computational problems on the same architecture 
while maintaining high efficiency by taking into account the inherent parallelism of robotics algorithms. To exploit 
the inherent parallelism of these robotics algorithms, our approach is first to characterize the set of parallel robotic 
algorithms based on the six specified characteristics and features, including type of parallelism, degree of parallel- 
ism, uniformity of the operations, fundamental operations, data dependency, and communication requirement. Our 
analysis shows that machines operating in the single-instruction-stream multiple-data-stream (SIMD) mode are the 
most efficient and suitable for our robotic algorithms. By fully considering the common characteristics and inherent 
parallelism of the robotics algorithms, a prototype of a medium-grained, reconfigurable, dual-network, SIMD 
machine with internal direct feedback has been designed for the computation of these kinematic and dynamic com- 
putational tasks. A systematic mapping procedure has been developed for scheduling these robotic computational 
tasks onto the proposed SIMD machine. This procedure builds a task table which contains the subtask assignment 
from the original parallel algorithm. Then a simplified task table and an input table are produced through the nota- 
tion simplification. These two tables are then used as inputs to the neighborhood scheduling algorithm which reord- 
ers the processing sequence of the subtasks into a rescheduled task table to reduce the communication time. Finally, 
the subtasks in this rescheduled task table are mapped onto the proposed SIMD machine and a control table which 
describes the control sequence in the machine is produced. A benchmark algorithm which contains the characteris- 
tics of the six basic robotic computations has been implemented on the proposed SIMD machine, and the mapping 
results are included for discussion. & 

2. Characteristics of Parallel Algorithms 

A key factor to the design of a parallel architecture for a group of algorithms is the understanding of their 
architectural requirements, and this requires us to identify the characteristics of these algorithms. This identification 
is usually helpful because the algorithms from a given application area such as robotics often possess an identifiable 
structure. In order to examine the characteristics of the six basic robotics parallel algorithms, a set of features which 
have the greatest effects on the execution of parallel algorithms is defined for robotics application [1]. 

■ Type of parallelism. Two levels of parallelism can be identified. 

(a) Job-level parallelism. The original algorithm is reformulated to a parallel processable form. In this level, 
the variables carrying the same kind of information but with different indices (e.g., for different links or 
joints of a manipulator) are processed parallelly. Due to the nature of the robot’s serial link structure, vari- 
ables representing the same physical meaning are defined for each link such as joint velocities,’ joint 
accelerations, and joint torques. Usually, the same class of variables are produced through an identical 
computational procedure but with different set of data. This property is called uniformity of operations as 
defined below. So the job-level parallelism will often be amenable to the SIMD implementation and usu- 
ally the required number of processors depends on the number of degrees of freedom of the manipulator 
(i.e., one processor for each joint). 

(b) Task-level parallelism. The original algorithm is decomposed into multiple subtasks. While the computa- 
tion within a subtask is serial, the number of subtasks that can be processed concurrently is maximized by 
using some scheduling techniques. Obviously, this implies multiple-instruction-stream multiple-data- 
stream (MIMD) operations. Furthermore, for this level of parallelism, a subtask usually performs the same 
computation for different set of data, and hence the operation can be pipelined. An advantage of this task- 
level parallelism is that the required number of processors is independent of the number of degrees of free- 
dom of the manipulator. 

■ Degree of parallelism (Granularity). Three levels of granularity are distinguished. In the large grain granular- 
ity, the parallelism is performed at the algorithmic level. That is, only the parallelism between different seg- 
ments or subtasks is considered. For the medium grain granularity, the concurrency is considered at the opera- 
tion level and the parallelism is performed based on some basic mathematical operations such as vector cross 
product and matrix-vector multiplication. If we consider the implementation of parallelism within the basic 
arithmetic operations, then the fine grain granularity is achieved. Different degrees of parallelism often imply 
different synchronization requirements. The finer the granularity is, the more frequent synchronization is 
required. 

■ Uniformity of operations. A robotics algorithm is said to possess uniformity of operations if the required com- 
putations for some set of variables, especially the joint variables, are uniform. An algorithm with operation uni- 
formity can be implemented on an SIMD machine with higher efficiency. 


461 


■ Fundamental operations. Algorithms in an application area usually perform similar mathematical operations. 
The ide n tific ation of basic operations performed in the algorithm will dictate the processor capabilities need . 

■ Data dependency. Three kinds of data dependency are classified for robotics algorithms: local neighborhood 
dependency , special type dependency, and global dependency. The local dependency means that the required 
operands in an operation come from its neighborhood; for example, from the results of last operation or using 
the same operands of last operation. The special type dependency is defined for some special equation or prob- 
lem. There are some special types of data dependency that are peculiar and inherent to the robotics algonthms 
Among them, the homogeneous (or hetero-homogeneous) linear recursive type of dependency which describes 
the data dependency in a homogeneous (or hetero-homogeneous) Unear recursive equation appears, most fre- 
quently. This linear recurrence structure plays a major role in the robotics algonthms because the vanables o a 
joint are usually related to the corresponding variables of its adjacent joint due to die robot s serial bnk structure 
Other special types of data dependency are defined for some weU-known problems; for example, system of 
Unear equations and Column-Sweeping algorithm for a triangular Unear system. The global dependency means 
that the results of some operations may be required by other operations or equations that may appear in o 
places of the algorithm. Since few algorithms possess absolutely one kind of data dependency, we can just id - 
tify whether an algorithm is local data dependency oriented or not The data dependency in an algorithm usual y 
riWfltpg memory organization, data allocation, and communication requirements. 

■ Communication requirement. The communication requirement decides the required interconnection type 
between processor and processor or between processor and memory. Three types of interconnection are con- 
sidered: one-to-one connection, permutation and broadcast connections. Of course, the exact required inter- 
connection type for each computation in an algorithm depends on many factors such as task alignment of each 
processor, data allocation in the memories, and data dependency of each computation. Hence, the exact 
required interconnection type can only be decided at the time of the algorithm-architecture mapping process. In 
examining the features of robotics parallel algorithms, only rough connection requirements can be observed. 

3. Characterization of Basic Robotics Parallel Algorithms 

Based on the above set of features, each of the six basic robotics algorithms have been carefully examined 
and analyzed to find the common features and characteristics among them [2], Only the final results i are V™™** 
here, which are useful for better understanding of the robotics computations and for designing a suitable parallel 
architecture for their computations. 

Inverse Dynamics Problem. Among various methods for computing the inverse dynamics problem, the one based 
on the Newton-Euler (NE) equations of motion is the most efficient [3). Since this method has been showntopos- 
sessthe time lower bound of 0(a) running on uniprocessor computers, where a is tire number of degnres^f- 
freedom of the ma nipulat or, further substantial improvements in computational efficiency appear unhke y. 
Nevertheless, some improvements could be achieved by taking advantage of particular computation 
customized algorithms/architectures for specific manipulators [5], parallel computations [6,7], and scheduling algo 
rithms for multiprocessor systems [8-11]. 

Forward Dynamics Problem. Among various methods for solving the forward dynamics problem [12-14], the 
composite rigid-body method [12], based on the computation of the NE equations of motion, is widely used to 
develop efficient parallel algorithms [14-16]. The composite rigid-body method is: suitable for P^® lpr ? c ^ 1 
because efficient parallel algorithms for the inverse dynamics computation have been well developed and can be 

used to speed up the computation time. 

Forward Kinamatica Problem. Using the Denavit-Hartenberg matrix representation for establishing the lit* coor- 
dinate frames [17,18], the solution to the forward kinematics problem is the successive multiplication of the 4x4 
homogeneous link transformation matrices for an n-link manipulator 


t=aJa?a! ••• AS_, ••• a;_! 


(i) 


where A‘_i is the D-H link transformation matrix which relates the ith coordinate frame to the (i l)th coordinate 
frame [17~18]. The above successive matrix multiplication equation can be reformulated in a homogeneous linear 

recursive form 

Tj = Aj and T^Tjf^U for i = 2, ••• ,» , (2) 

from which the configuration of all the coordinate frames can be obtained at the time lower bound [7,19,20]. 

Forward Jacobian Problem. Existing methods in computing the Jacobian are mostly confined to uniprocessor 
computers. In particular, Orin/Schrader [21], and Yeung/Lee [22] exploited the linear recurrence ctaractenstics of 
the Jacobian equations. These methods differed from each other only by a different selection of the reference 
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coordinate frame for computation. The reference coordinate frame is selected such that all the vectors and matrices 
and the Jacobian computed are referred to that reference coordinate system. They all have the computational order 
of 0(n) for an n-jointed manipulator. 

Inverse Jacobian Problem. The inverse Jacobian algorithms for a general manipulator can be divided into two 
categories. One is to calculate the inverse or the generalized inverse Jacobian explicitly [23]. The other is to con- 
sider the inverse Jacobian problem as a system of linear equations and solve the joint rate from the Cartesian velo- 
city implicitly [24]. For practical purposes, the latter approach is easier to be parallelized due to the use of some 
standard techniques to solve a system of linear equations such as the Gaussian elimination method. 

Inverse Kinematics Problem. In general, the inverse kinematic position solution can be obtained by various tech- 
niques [18], among which the inverse transform [25] and the iterative method [26] are widely discussed. The 
inverse transform technique yields a set of explicit, non-iterative joint angle equations which involve multiplica- 
tions, additions, square root, and transcendental function operations. The iterative methods can obtain robot 
independent joint solution, but they usually have some disadvantages: more computations than the closed-form 
solution, variable computation time and, more important, convergence problem, especially in the singular and 
degenerate cases. We shall examine the characteristics of the inverse transform technique and the iterative methods. 

The equations for closed-form solution appear highly non-uniform [27]. To achieve higher parallelism for the 
inverse kinematics problem, the iterative method provides a better approach, since nearly every presented iterative 
method contains the computations of forward kinematics, forward Jacobian, and inverse Jacobian [26], which have 
been shown to be highly parallelized. 

If we consider these six basic robotics computations as a set of tasks that we need to compute for the control 
of robot manipulators, then we need to find their common features and characteristics so that a parallel architecture 
can be designed to efficiently compute these tasks. The characteristics of the six basic robotics algorithms are tabu- 
lated in Table 1 and it shows that these algorithms do possess some important common features and characteristics. 
This is especially true for the inverse dynamics, the forward dynamics, the forward kinematics, and the forward 
Jacobian computations for the following three reasons. First, they are all suitable to be parallelized at the job-level 
and the parallelization can be performed at the large, medium, and fine grain granularities simultaneously, although 
different granularities are emphasized in each individual algorithm. Second, their operations are all uniform for the 
variables corresponding to each joint, and the most important fundamental operation is the matrix-vector operation. 
Finally, the strongest common feature is that they are all in homogeneous linear recursive form, for which the recur- 
sive doubling technique can be applied to achieve the time lower bound of £>(flog 2 n] ). The communication 
requirement indicates that one-to-one and some regular or irregular permutation capabilities are required for these 
four computational problems and the broadcast capability is necessary for the forward dynamics and the forward 
Jacobian algorithms. This indicates that some efficient, versatile network is required in the parallel architecture for 
their computations. 

The inverse Jacobian and the inverse kinematics computations may seem less common to the above four algo- 
rithms. However, if less efficient methods to solve these two problems are chosen individually, then these two algo- 
rithms may possess some common features to the other four algorithms, and a common parallel architecture can be 
designed to match all these common characteristics for their computations. From previous discussions, we found 
that either the direct method or the iterative method for the inverse Jacobian is a proper candidate for parallel pro- 
cessing, while the direct method is more efficient with somewhat complex data dependencies. For the inverse 
kinematics problem, only the iterative method possesses regular properties similar to the other four computations. 

With all the characteristics listed in Table 1, we shall next examine how to reformulate and parallelize these 
robotics algorithms from their original serial algorithms by complying to their common features [2]. The paralleli- 
zation process is performed at the job level; that is, we try to express the original algorithms as a sequence of serial 
steps (jobs). Each individual step is accomplished through the cooperation of all the processors and for each step, 
the operations of each processor are almost identical by using one of their common features: the uniformity of 
operations. Hence, each step can be considered to be a single instruction in a serial program. Two different steps 
(or jobs) are identified after the parallelization process: single steps and macro steps. The notion of “single instruc- 
tion” and “subroutine” of a serial program can be used to distinguish between these two steps. A single step 
corresponds to a single instruction in a serial program, while a macro step corresponds to a subroutine in a serial 
program. The macro steps require more complex parallel computations for all the processors, for example, the 
homogeneous linear recursive equation, the hetero-homogeneous linear recursive equation, and the system of linear 
equations are all macro steps. These macro steps are identified by their completeness and repeatity. The complete- 
ness means that the step can be treated as an individual problem. The technique to process these macro steps paral- 
lelly needs special consideration and the algorithm to solve these steps is so well-structured that finer decomposition 
is not helpful or even impossible, for example, the parallel recursive doubling technique for solving the homogene- 
ous linear recursive equation, or the parallel Cholesky factorization technique for solving the system linear 
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equations with a symmetric-positive-definite square matrix. The repeatity means that the problem which can be 
solved in the step is so important and common that it appears repetitively at many other places; for example, many 
equations of robotics algorithms are in homogeneous linear recursive form, then the procedure for parallelly solving 
this problem can be applied to all these places. The method to parallelize each of these macro steps is designed 
separately. 

Instead of computing all the six basic robotics algorithms, we synthesize a benchmark algorithm (see Table 2) 
which represents the general structure of the basic robotics parallel algorithms. This benchmark algorithm consists 
of six serial steps, and each step needs the cooperation of n processors. This benchmark algorithm will be used to 
demonstrate the whole process of mapping the "serial type" parallel algorithms onto a proposed parallel architecture 
in the following sections. 

4. Design of Algorithmically-Specialized Parallel Architecture 

In this section, an appropriate parallel architecture with the attributes that best match the common features of 
the six basic robotics parallel algorithms is designed. The important parallel architecture attributes include the type 
of machine (e.g., SIMD or MIMD mode), number of processors, synchronization requirement, processor capabili- 
ties, memory organization, and network requirement Each of these attributes is affected by one or more features of 
the six basic robotics algorithms discussed in section 3. Detailed consideration for the design of this machine can be 
found in [2]. With all these requirements and attributes, the appropriate parallel architecture is a reconfigurable, 
dual-network, SIMD (DN-SIMD) machine for the computation of robotic algorithms. 

The structure of the proposed DN-SIMD machine, as shown in Fig. 1, consists of multiple processing ele- 
ments, two reconfigurable interconnection networks (RIN1 and RIN2), a set of global data registers (GDRs), three 
data buffers including register output buffer, PE output buffer and input data buffer (IDB), and a set of multiplexers. 
All of these are coordinated by a central control unit (CU) which is not shown in Fig. 1. The functions of each ele- 
ment are briefly described here. 

1. Processing Element (PE). There are n identical PEs. Each PE is essentially an arithmetic logic unit (ALU) 
with a ttar hp/t working registers (see Fig. 1). All the ALUs perform the same programmable function synchro- 
nously in a lock-step fashion under the command of the CU. Some of the PEs can be masked (disabled) for 
some computation period, while other unmasked or enabled PEs perform computations. Each PE has two input 
working registers (IWRs) which are used to store two operands for each computation, and one output working 
register (OWR) which is used to store the current result of each computation. The operands in the IWRs are 
kept there until they ate replaced. Thus, they can be used repetitively if one or two operands are common for a 
series of continuous computations. An inner loop connection within a PE is designed, which connects the 
OWR to one of the two IWRs. This provides an immediate inner-PE forwarding path such that the current 
result can be used as an operand for the next computation immediately. 

2. Global data registers (GDRs). There are n groups of data registers which correspond to the n global memory 
modules. In each computation period, the registers with the same relative position in each group can be 
accessed under the control of the CU. The result of each computation from each PE will be stored in the GDRs 
only when either the result is the final output or the result will be used in later computations but not the 
immediate following one, which can make use of the internal forwarding path for data exchange among PEs or 
inner loop within PEs. 

3. Reconfigurable interconnection networks. There are two sets of identical interconnection networks: RIN1 and 
RIN2. They are assumed to have full connectivity including one-to-one, permutation, and broadcast capabili- 
ties (e.g., the crossbar network). The RIN1 connects the GDRs to the PEs. This provides the paths for sending 
required operands to the appropriate PEs. The RIN2 makes the connection from the outputs of PEs to the 
inputs of PEs; this provides the direct paths for internal forwarding data exchange among PEs. It should be 
noted that, if necessary, the output of PE i can be stored into its corresponding memory module i. This is not 
affected by the RIN2. 

4. Data buffers. There are three sets of data buffers. The register output buffer allows the “current computa- 
tion” and the “RIN1 reconfiguration and operand fetch for the next computation” be processed at the same 
time. The PE output buffer allows the “current computation” and the “RIN2 reconfiguration and output data 
storing” be processed simultaneously. The input data buffer (IDB) is the buffer for operands directly from 
external input data. 

5. Multiplexer. The n multiplexers in advance of the n PEs are used to select proper operands to enter PEs from 
three possible sources: GDR, IDB, and IFD exchange. They are also under the control of the control unit 

With the functions of these elements described above, the basic mathematical operations performed by each 
PE of the DN-SIMD machine involve at most two operands. 
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T = AoB ( 3 ) 

where A and B are two arbitrary operands and they can be scalar, vector or matrix, and ‘V indicates the operation 
performed by the PE. When either A or B is null, the computation only involves one operand such as the transpose 
of a matrix. The operands A or B may come from five different sources. They are GDRs through RIN1, IFD 
exchange through RIN2, IDB, IWR within PE, and OWR within PE through inner loop connection. The result T 
may be sent to two possible destinations: GDRs directly, or PEs through RDM2 via IFD exchange. The possible 
input operands and output result transfer path diagrams are illustrated in Figs. 2(a) and 2(b) respectively. In Fig. 
2(a), we demonstrate all the possible source combinations except the case that the operands come from the IWR 
within the PE. We assume that the time to transfer one operand from the GDR or the IDB to a PE (i.e., operand 
fetching) is the same as the time to transfer the output result from a PE to the GDR (i.e., result storing) and equals to 
the computation time of one basic PE operation. This time interval is called a cycle. Since operands fetching, com- 
putation, and result storing can be performed simultaneously due to the data buffers designed in this system, a 
three-stage pipelined operation can be performed on our DN-SIMD machine. Since a computation usually needs 
two operands A and B, and if A and B come from different sources, then they can be transferred to a PE simultane- 
ously in one period. In this case, the three-stage pipelined operation proceeds normally. However, if A and B come 
from the same source (e.g., GDR or IDB), then it will take 2 cycles to transfer them. This situation is called the 
double transmission required (DTR) computation. In this case, a delay period must be added to the pipeline opera- 
tion to synchronize the operation. This DTR computation obviously will slow down the system speed. Hence, we 
need to minimize the number of DTR computations in a computational task. 

5. Mapping of Parallel Robotic Algorithms onto the Dual-Network SIMD Machine 

Since our DN-SIMD machine was designed to best match the common characteristics of the six basic robotics 
parallel algorithms, the scheduling of their computations in our system is more straightforward with less difficulties 
as compared with other general mapping problems. Based on this characteristics matching, a systematic and 
efficient mapping procedure is developed to map the parallel robotic algorithms onto the proposed medium-grained 
DN-SIMD machine. 

The proposed mapping procedure consists of three stages [2]. In the first stage, each of the single steps of 
these parallel robotic algorithms is further decomposed to a set of “subtasks” and each subtask possesses the basic 
mathematical form of consisting at most two operands. On the other hand, each of the macro steps in these algo- 
rithms is viewed as a subtask and is not decomposed at this stage. The first stage results in a series of parallel sub- 
tasks. In the second stage, these subtasks are reordered to reduce the number of DTR operations through a neigh- 
borhood scheduling algorithm. The reordered subtasks will be mapped onto the DN-SIMD machine directly in the 
third stage. In the final stage, the actual implementation of the macro steps in the parallel algorithms on the DN- 
SIMD machine is performed. Using the benchmark algorithm in Table 2 as an example, the details of these three 
stages of our mapping procedure are discussed in the following subsections. 

5.1. Subtask Assignment 

Since the proposed DN-SIMD is a medium-grained machine and is synchronized at each basic mathematical 
operation, each parallel algorithm must be decomposed into a series of subtasks. Each subtask is either in the basic 
mathematical form which involves at most two operands or in a well-defined macro step. Although this functional 
decomposition can be easily performed on the single steps, it is not the case for the macro steps, in which the data 
dependencies are so complex that the decomposition based on basic computational unit is not obviously feasible. So 
the macro step will be viewed as a single subtask in this stage. Consider the decomposition of the following equa- 


K = Lx(C + E) + G xC . ( 4 ) 

Here we use three temporary variables, T\, T 2 , and T 3 to rewrite Eq. (4) into four simple equations in the basic 
mathematical form: 


T,=C+£ , T 2 =LxT l , T 3 = GxC , and K = T 2 + T 3 (5) 

This same technique is applied to our decomposition process for single steps. For clarity, the benchmark 
algorithm is used as an example to demonstrate the technique. The decomposition result and the original algorithm 
are shown in Table 2. Here, two sets of variables are introduced: T.’s represent the immediate results (temporary 
variables) or the final outputs. If T* is a macro subtask, then it is specially denoted as 7,’s represent the external 
input variables; that is, the variables that do not come from the outputs of other computations. 

To ease the subtask scheduling in the second stage, notation simplification is performed on the above task 
table to produce a simplified task table as shown in Table 3. In this table, two arrays are defined: TB[j] contains the 
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identification of subtasks T.’s and OP[i] represents the corresponding operation for subtask TB[j]. Each element of 
OP[i] is either a macro subtask or in the form of A oB, where A and B may be T,- (Tj) or Moreover, the super- 
script on A or £ indicates the difference between the index i of the result, Tj t [i], and the index k or / of its operand 
T h [k] or T Jt m, where =T jt [k ] oT jt [[). For example, the subtask T , [i] = T 2 [i+ 2] o 7’ 3 [i— 1] is denoted as 
tI = T\ o Ti 1 . If their indices are equal, that is, i = koti = l, then the superscript is omitted. For example, the sub- 
task r 4 [ n= T 5 [ i] o T 6 [t] is denoted simply as T 4 = T 5 o T 6 . The simplified task table is the final result of this stage 
and will be used as the input for the next stage. 

5.2. Subtask Scheduling 

To schedule the subtasks for computation, we first observe all the possible operand sources and their combi- 
nations for each computation. The operand may be one of the four possible types denoted as Sj, So,> Sj, Sot 
which correspond to four kinds of different sources. 5/ denotes the operand from the IDB and it needs one period of 
transmission time. Soi denotes the operand which is fetched by the previous computation (subtask) from the IDB 
and is still in the IWR within the PE, so no transmission is required for this operand. S T denotes the operand from 
the GDR and this operand requires one cycle of transmission time via the network RIN1. Sot denotes the operand 
from other sources including the following three possibilities: (i) The operand which is fetched by the previous 
computation from the GDR and is still in the IWR within the PE, so no transmission time is required; (ii) Current 
computation result through the inner loop; (iii) Current computation result through the internal forwarding path with 
data exchange provided by the network RIN2. The transmission time for the last two cases is ignored when com- 
pared to the system cycle time. Using these notations, all the possible combinations of operand sources including 
the situation of only one operand are listed below: 


(S', , S",) 

(S , , S T ) 

(Sot * St) 

(So, , So,) 

(S,) 

(S , , S 0 f) 

So t > Sot) 

(S't , S" T ) 

(Sot • Sot) 

(So,) 

(S , , Sot) 

(So , . Sj) 

(S, rS,) 

(Sj , St) 

(St) . (Sot) 


where the prime superscripts are used to distinguish different operands from the same kind of source. Among these 
situations, the combinations (S', , S",) and ( S ' T , Sj) are DTR operations and require two cycles to transmit two 
operands through the same transmission path. It is possible to eliminate DTR operations, if we reorder the process- 
ing seq u ence without violating the constraint of precedence relation. That is, in these two situations, one operand S T 
(or St) can become the type S OT , or S)(or S'{) can become the type S ol . Then, the DTR operation phenomena can be 
avoided and the unnecessary transmission can also be avoided for the efficient use of the same data repetitively and 
instantly. 

A neighborhood scheduling algorithm for scheduling and reordering the execution of these subtasks to minim- 
ize the total number of DTR operations has been developed and is considered here. 

Definition 1. For two subtasks in the *th and /th rows of the simplified task table, TB[k] and TB[I), assume 
OP[k]=AoB and OP [/] = C oD, where A, B, C, and D are operands, each with one of these possible types: 
{Ij, Tj, T)). Then the subtask TB[k ] is called a neighborhood of TB[[\ if all the following conditions are satisfied: 

(0 k<l, 

(ii) C = TB[k] or C = TB‘[k] orC = AorC = Bor 
D = TB[k ] otD = TB‘[k]ot D = A orD =B . 

From the above definition, we know that if subtask TB[I) has a previous subtask TB[k] as its neighborhood 
(k < l) and moreover, if these two subtasks are next to each other; i.e., / = k + 1, then at least one operand of subtask 
TB[l] comes directly from the result or operand of subtask TB[k] without accessing the GDR or the IDB. This obvi- 
ously will save the communication time to access global memories, and the subtask TB[l] will never be a DTR sub- 
task, thus minimizing the number of DTR subtasks. 

Definition 2. A subtask in the Jfcth row of the simplified task table TB[k) is called a double transmission 
required (DTR) subtask if the following two conditions are satisfied: 

(i) Its operand is one of these types: 

OP[k\ = TB[m] o TB[n] for some m, n <k and m*n. 

OP[k] = TB‘[m\ o TB[n] for some m, n <k and m * n . 

OP[k] = TB[m\ o TB'[n] for some m, n < k and m *n . 

OP[k ] = o TB j [n] for some m, n <kwAm*n . 

OP[k) = I[m ] o/[n] for m */i . 

(ii) k = 1 or TB[k-l] is not a neighborhood of TB[k] for k > 1 . 
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Notice that for OP[k] = TB[m] o I[ri\ and OP[k] = TB‘[m] o /[n], the subtask OP[k] is not a DTR subtask because its 
two operands can be transmitted simultaneously through two different set of connection lines. Moreover, a subtask 
involves only one operand is obviously a non-DTR subtask. For example, in the simplified task table of the bench- 
mark algorithm, subtasks and I 17 are all DTR subtasks as indicated in Table 3. 

From the above definition, whether a subtask is a DTR subtask depends on its “position” in the simplified 
task table. A DTR subtask can become a non-DTR subtask if it is moved to the place exactly behind its neighbor- 
hood. Since it is possible that the movement of a DTR subtask may introduce another new DTR subtask, this reord- 
ering process is desirable only when it complies with the precedence constraint of the original algorithm and the 
number of DTR subtasks in the reordered task table is less than that in the original table. This forms the scheduling 
problem ; that is, to reorder the processing sequence of subtasks to reduce the number of DTR subtasks as far as pos- 
sible without violating the precedence constraint of the original algorithm. This reordering process can be per- 
formed by the following efficient neighborhood scheduling algorithm. 

Algorithm N-Scheduling (Neighborhood Scheduling Algorithm ). 

Input: Simplified Task Table with n rows (i.e., n subtasks). 

Output: Reordered Task Table. 

Nl. [Main Loop] Check each subtask to see if it is a DTR subtask. If yes, try to change its position. 

For k = 1 step 1 until n do 

N2. [Check DTR] 

Check if 7B[k] is a DTR subtask according to definition 2? If not, go to step N4. 

N3. [Main Body] Try to change the position of a DTR subtask to make it into a non-DTR subtask. 

If OP[k] = ( TB[m ] or TB a [m]) » (TB[n] or 7B*[n]), 

then let i <— max(m, n); 

else lets <- 1; {* OP[k]=I[m] „/[/»] *} 

End [If] 

While i < ifc-1 do 

If TB[i] is a neighborhood of TB[k], then 

If {7B[i+l] is a DTR subtask} or [the insertion of 7B[k] between 7B[i] and 
7B[i+l] will not make TB[i+X) a DTR subtask}, 

then insert TB[k] behind TB[i] to make TB[k] the new (i+l)th subtask; 
go to step N4 
End (If} 

End {If} 

Lett <— i+1; 

End {While} 

N4. Continue {main loop} 

End {For} 

END. {N-Scheduling} 

As an example, the N-Scheduling algorithm is applied to the benchmark algorithm. The input is the 
simplified task table in Table 3, which has a total of 18 subtasks and six of them are DTR subtasks. After applying 
the N-Scheduling algorithm to this simplified task table, the reordered task table is produced as shown in Table 3, in 
which all the DTR subtasks in the simplified task table have been removed. 

5.3. Mapping Procedure 

The reordered task table produced by the N-Scheduling algorithm can be mapped onto the proposed DN- 
SIMD machine in a rather straightforward way because these subtasks are all single-step, simple subtasks. If the 
subtasks are macro steps, then their mapping requires further consideration. Our mapping procedure at this final 
stage consists of two phases. In the first phase, the subtasks including single steps and macro steps which are 
viewed as single steps temporarily are mapped onto the DN-SIMD machine in a row directly. The actual mapping 
of the macro steps is considered in the second phase. The output of the mapping procedure is a control table as 
shown in Table 4. This table consists of ten columns and indicates the exact movement of the central control unit. 
The first column represents the identification of subtasks appearing in processing order. It also represents the result 
of the corresponding subtask. The second column indicates the first operand; it may be 7) ( T ) , 7) , 7) ) or Ij for 
some i. The third column indicates the source of the first operand, and there are five possibilities: the GDR, the 
IDB, the IFD, the IWR and the OWR within the PE. The fourth column describes which network is used (RIN1 or 
RIN2) and the required connection type on it to transmit the first operand if necessary. Columns 5 to 7 contain the 
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same information as the previous three columns, but for the second operand if it exists. Column 8 indicates the 
operation performed in this subtask. Column 9 indicates the destination of the result; it may be the GDR, the IFD, 
or both. If the IFD is needed, the connection type of network RIN2 is specified. Column 10 contains some com- 
ment on this subtask. For a macro subtask, these columns possess somewhat different meanings. Columns 2-7 indi- 
cate the corresponding information for the initial conditions of the macro subtask (similar to the parameters for a 
subroutine in a serial program). Columns 9-10 indicate the corresponding information for the final result of the 
macro subtask (similar to the return values of a subroutine in a serial program). 

At the end of phase 1 of the mapping procedure, the control table of the benchmark algorithm is obtained as 
shown in Table 4. Since there are three macro subtasks in the control table, further mapping must be performed in 
phase 2. Among these macro subtasks, T \ and T 6 are the HLR equations, and 7’ n is the HHLR equation. The map- 
ping of HLR equations are demonstrated next. 

The first-order homogeneous linear recurrence equation is defined as: Given x(0) = a (0) = null, and 
a (i), 1 < i <n, find all the x(i) for 1 < i <n from the following recursive equation 

jc(i) = x(i-l)oa(i). (6) 

An efficient technique called the recursive doubling technique has been found to solve this recursive equation 
efficiently on an SIMD machine [7,19]. Using this technique, the parallel algorithm to solve Eq. (6) and the map- 
ping diagram of this algorithm onto the proposed DN-SIMD machine are shown in Fig. 3. This diagram possesses 
the same information as a control table including the sources of operands, destination of result, network used and 
required connection types for each iteration. It takes an order of 0(f log 2 (n+l)"| ) iterations to produce the final 
results. Also notice that, in Fig. 3, we assume that the initial conditions a(i)’s come from the IDB. In fact, they 
may also come from the GDR depending on whether a (i')’s are external input variables or not. In that case, its map- 
ping diagram is exactly the same except that the a (i)’s are from the GDR through the network RIN1 at the begin- 
ning. Similarly, the final results x(i)'s can be stored in the GDR or directly fedback to PEs depending on the neces- 
sity of the next subtask. Using the similar techniques, the mapping of HHLR equations can also be performed [2]. 

6. Conclusions 

To design a global architecture for a set of parallel robotics algorithms, the characteristics of these algorithms 
are identified according to six fundamental features: degree of parallelism, uniformity of operations, fundamental 
operations, data dependency, and communication requirements. Considering the characteristics matching between 
the common features of the robotics algorithms and the architecture features, a medium-grained, DN-SIMD 
machine is designed. It consists of two sets of reconfigurable interconnection networks. One provides the commun- 
ication between the PEs and the GDRs. The other provides the internal direct feedback paths among PEs to avoid 
unnecessary data storing and routing time. This machine performs three-stage pipelined operations and is synchron- 
ized at each basic mathematical calculation. 

With the parallel robotics algorithms and the proposed DN-SIMD parallel machine, a systematic mapping 
procedure to schedule the subtasks of the parallel algorithms onto the parallel architecture is developed. This map- 
ping procedure consists of three stages. At the first stage, mathematical decomposition is performed on the parallel 
algorithms to achieve a series of subtasks and each subtask is either in the basic mathematical form which involves 
at most two operands, or a well-structured macro subtask such as the linear recurrence equations. At the second 
stage, to shorten the communication time, the processing sequence of subtasks is reordered to minimize the total 
number of DTR subtasks using the Neighborhood Scheduling algorithm. At the final stage, the reordered subtasks 
are mapped onto the DN-SIMD machine. In this process, the single-step subtasks can be mapped directly, while the 
macro-step subtasks need further design and special technique such as the recursive doubling technique for solving 
the linear recurrence equations. A benchmark algorithm was used throughout as an example to illustrate the map- 
ping procedure. 
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Table 1 . Characteristics of Basic Robotics Algorithms. 


CHARACTERISTICS 

Communication 

Requirement 

(Regular) 

Permutation 

one-to-one 

Permutation 

Broadcast 

(Regular) 

Permutation 

(Irregular) 

Permutation 

Broadcast 

Permutation 

Broadcast 

Permutation 

Broadcast 

one-to-one 

Broadcast 

one-to-one 

Permutation 

Broadcast 

Data 

Dependency 

HLR 

HLR.HHLR 
SHLR, PNE 
System of Linear Eqs. 

HLR 

HLR (Forward 
& Backward) 

Global 

Local 

Global 

Local 

Fundamental 

Operations 

Matrix-Vector 

Scalar ops. 
Reciprocal 
Matrix-Vector 

Matrix Mult 
Trigonometric 

Matrix-Vector 

Scalar ops. 
Reciprocal 
Vector ops. 

Scalar ops. 
Reciprocal 
Vector ops. 

Scalar ops. 
Reciprocal 
Square root 
Trigonometric 

Scalar ops. 
Reciprocal 
Matrix-Vector 
Trigonometric 

Uniformity of 
Operations 

Yes 

Yes 

Yes 

Yes 

Yes 

Yes 

No 

Yes 

Degree of 
Parallelism 

Large grain 

Large grain 

Medium or 
Fine grain 

it 

11 

Medium or 
Fine grain 

Medium or 
Fine grain 

Fine grain 

Medium or 
Fine grain 

Type of 
Parallelism 

Job level 

Job level 

Job level 

Job level 

Job level 

Job level 

Task level 

Job level 


Algorithms 

Inverse 

Dynamics 

Forward 

Dynamics 

Forward 

Kinematics 

Forward 

Jacobian 

Inverse 

Jacobian 

(Direct) 

Inverse 

Jacobian 

(iterative) 

Inverse 

Kinematics 

(Direct) 

Inverse 

Kinematics 

(iterative) 
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Table 2. Robotics Benchmark Algorithm and Subtask Assignment. 
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where in this table, except separate indication, i is from 1 to n. 

£[i], D[i], F[r], H[i],J[i],L[i], N[i]; i = 1, are assumed to be input variables. 
B [i ] is a 3x3 matrix. D [r ], F[i ] are 3x1 vectors. 

H[i],J [i ], L [i ], N [/ ] are all scalars. 












































Table 3. Simplified and Reordered Task Table of Benchmark Algorithm. 



SIMPLIFIED TASK TABLE 

REORDERED TASK TABLE 

ROW 

TB[ROW] 

OP[ROW] 

DTR 

TBfROW] 

OP[ROW] 

DTR 

1 

7 i 



7, 

7T‘x/i 


2 

t 2 

7 1*2 


7- 2 

fi / 2 


3 

t 3 



7-3 

f t / 3 


4 

t 4 

7 - 3/4 


7-7 

7 2 +7 3 


5 

t 5 

^/S 


7-4 

T 3 I 4 


6 

T 6 

T 6 '+Ts 


Ts 

7 - 4/5 


7 

7 7 

r 2 +r 3 

X 

T 6 

7? +7-5 


8 

7* 

7 - 7/6 


mm 

f i*f« 


9 

t 9 

ffi+7-2 

X 

7-9 

t 6 +t 2 


10 

Tio 

T t +T 9 


7-* 

7 - 7/6 


11 

7„ 

ItTh+Tiq 


Tn 

7*8+79 


12 

T\2 

T 6 +Ttl 

X 

Tn 

f«+7ft 


13 

■m 

T?+fu 

X 

Tn 

ilflPwl 


14 

7, 4 

T n+T 13 


T l7 

77 l *7n 


15 

7, 5 

7 i*f« 

X 

Tn 

77 1 +7n 


16 

T 16 

771*7-15 


I'M 

7 12+7 13 


17 

Txi 

n'-fu 

X 

Tn 

77l*7i5 


18 

Tn 

7 1 16+7" 17 


Tn 

7 16 + 7 17 



where T x and r 6 are HLR equations and T n is an HHLR equation. 
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Table 4. Control Table for Benchmark Algorithm. 


o 

i-H 

Comment 

* HLREqn. 


1 



1 

* HLR Eqn. 






* HHLR Eqn. 





Result 


Output Destination 

RIN2 








1 

• 

1 


1 

• 

t 

i 

cn 

- 

t 

IFD 











X 





X 



GDR 

X 

X 

X 

X 




X 

X 


X 

X 


X 




X 

oo 

Operation 

MM 

MV 

MV 

VA 

AS 

AS 

VA 

MV 

VA 

AS 

VA 

VA 

* 

> 

VA 

VA 

> 

VS 

B 

Network 

Type 

1 

1 

1 

RIN1-1 

1 

1 

1 

RINl-1 

RINl-1 

l 

RINl-1 

RINl-1 

* 

RINl-3 

RIN1-4 

RINl-1 

RINl-1 

1 

5 

04 

\D 

Source 

2 

IDB 

IDB 

IDB 

GDR 

IDB 

IDB 

OWR 

GDR 

GDR 

IDB 

GDR 

GDR 

* 

GDR 

GDR 

GDR 

GDR 

GDR 

m 

Operand 

2 

c 

CM 

cn 

B 

r 

TT 

I 

hi 

IK 

CM 

hi 

O 

h 

v© 

IK 

* 



CM 

hT 

»n 

h 

f- 

h 

B 

Network 

Type 

1 

1 

1 

1 

I 

I 

* 

I 

i 

RINl-1 

i 

RIN2-2 | 

i 

i 

i 

t 

RIN2-3 

i 

cn 

Source 

1 

IFD 

OWR 

IWR 

OWR 

IWR 

OWR 

IFD 

OWR 

IWR 

GDR 

OWR 

IFD 

IDB 

OWR 

IWR 

OWR 

IFD 

OWR 

<N 

Operand 

1 

Tk 

IK 

»hi 

cn 

hi 

cn 

hi 


B 

o 

IK 

IK 

r- 

hi 

00 

h 

01 r 


IK 

IK 

cn 

h 

E 

VO 

h 

f—4 

hT 


K 


K 

SI 


vO 

h 

hT 

K 

00 

hi 

o 

K 

fS 

K 

IK 

r- 

h 

cn 

K 

■» 

K 

'O 

K 

00 

hT 
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Connection type 1: straight connection; Connection type 3: uniform module shift (d = 1) 

Connection type 2: uniform module shift (d = 2); Connection type 4: uniform module shift (d = -1) 

















































4 GDR: Global Data Register 


(b) 


Figure 1 . (a) Structure of Dual Network S 
(b) The Structure of Processing 



otrnvr 


ALU : Arithmetic Logic Unit 

IR: Input Working Register 
OR : Output Working Register 


Machine. 

















I m J 

(a) Input Operands Flow Diagrams 



(b) Output Result Flow Diagrams 
Figure 2. Data Path Flow of Processing Element 
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Algorithm FOHRA {First-Order Homogeneous Recurrence Algorithm). 

FI. [Initialization] Given the terms a it 0 < z < n, let X w (i) be the z'th sequence at the *th splitting 
and s =flog 2 (rt+l)1 . Set the sequence at the initial step, X m {i) <- a it 0 <i<n. 

F2. [Compute x, parallelly] 
for* <- 1 to s, do 

* X (k ~ l \i) , if 2* _1 <i<n 

X (k) (i) = « 

X (k ~ l) (i) , if 0 < z < 2 k ~ x 

end (for) 

Set Xi<r-X M (i), l<i<n. 

END FOHRA. 



RIN2 RIN2 RIN2 


Figure 3. Mapping Diagram of First-Order Homogeneous Linear Recurrence Equations on 
DN-SIMD Machine. 
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